We can build a basic network object using an edgelist. Recall, this data structure is a list of node-pairs which have an observed relationship, and typically includes information on the magnitude of that relationship. For an introduction, we’ll start with data on cross-border banking flows between countries, from 1978 to 2014.
These data are structured to report net surplus flows between two states, and thus only include directed relationships where net capital flows from one country’s banking sector to the other in a single year:
load("banking_edgelist.RData")
head(bank_edge)
## sender receiver year weight
## 1 Japan Ireland 1978 1.727399
## 2 Ireland Netherlands 1978 2.593537
## 3 Ireland Sweden 1978 1.013054
## 4 Switzerland Ireland 1978 2.831801
## 5 Ireland United Kingdom 1978 6.501687
## 6 Netherlands Japan 1978 5.626296
The net surplus flow values are logged, and the edgelist reports each observed dyad yearly. Let’s pick a fun year (I don’t know, why not 1990?) and see where it takes us when we build the network object:
bank_edge_y<-bank_edge[which(bank_edge$year==1990),c(1:2,4)] #Subset edgelist to relevant year
bank_net<-graph.data.frame(bank_edge_y,directed=T) #Build network object from subset edgelist
bank_net
## IGRAPH c07799f DNW- 159 1375 --
## + attr: name (v/c), weight (e/n)
## + edges from c07799f (vertex names):
## [1] Japan ->Ireland Luxembourg ->Ireland
## [3] Ireland ->Netherlands Ireland ->Sweden
## [5] Switzerland ->Ireland Ireland ->United Kingdom
## [7] Ireland ->United States Luxembourg ->Japan
## [9] Japan ->Netherlands Japan ->Sweden
## [11] Switzerland ->Japan United Kingdom->Japan
## [13] Japan ->United States Luxembourg ->Netherlands
## [15] Luxembourg ->Sweden Switzerland ->Luxembourg
## + ... omitted several edges
Woo! We now have a network object in our environment reporting directed surplus flow relationships between states’ banking sectors for the year 2007. Here are a few basic commands for getting the number of nodes and edges, respectively, from a network object:
vcount(bank_net) #Vertex (node) count
## [1] 159
ecount(bank_net) #Edge count
## [1] 1375
What else can we do with this, you might ask? Many things! Let’s start by taking actually visualizing the network:
One of the best parts about network analysis is the ability to visualize social spaces. However, doing it right can be tricky. This section will introduce:
So anyone can take a network object and toss it into a plotting function…
plot(bank_net)
… but it doesn’t come out great. Ever heard the term ‘hairball plot’? This is what they mean. Without a nicer plotting code, any network image will be quite ugly.
Let’s build a network plotting function to use throughout the script that produces much lovelier images. We’ll allow for inputs on the network, whether we want vertex size variance and labels, whether to visualize edge weight by width, layout specifications, and a title:
net_plot<-function(network,sizes=T,labels=F,weight_width=F,layout,title){
if(labels==T){vnames<-V(network)$name} #T: label with names
else {vnames<-rep(NA,vcount(network))} #F: no labels at all
if(weight_width==T){width<-log(E(network)$weight)+1} #T: diff edge widths
else (width<-0.2) #F: fixed (thin) edge widths
if(sizes==T){V(network)$size<-10*degree(network,normalized=T)}
else {V(network)$size<-5} #T: degree-weighted size; F: fixed (small) size
plot(network,vertex.label=vnames,edge.arrow.size=0.05,
edge.curved=seq(-0.5,0.5,length=ecount(network)),
edge.width=width,layout=layout,main=title)
}
net_plot(bank_net,layout=layout_with_fr(bank_net),
title="Banking Surplus Flows: 1990")
Now that’s much nicer than the first one, but still a little heavy. As in most large network objects, it can be helpful to clear out some of the smaller, or theoretically less relevant edges to get a closer (or different) look.
What if we want to only observe relationships with weights above a particular threshold? We call this thinning in network analysis. The following code (i) removes edges below a set weight threshold, (ii) deletes nodes which no longer have ties as a result of that thinning, and (iii) plots our newly thinned network:
threshold<-6 #Set threshold for edge deletion
#Remove edges with bracket and which() referencing by weight and threshold#
bank_net_thin<-delete.edges(bank_net,E(bank_net)[which(E(bank_net)$weight<threshold)])
#Remove nodes with bracket and which() reference by degree (we review this later)#
bank_net_thin<-delete.vertices(bank_net_thin,V(bank_net_thin)[which(degree(bank_net_thin)==0)])
net_plot(bank_net_thin,layout=layout_with_fr(bank_net_thin),
title="Thinned Banking Surplus Flows: 1990")
Maybe we’d like to highlight a particular (set of) node(s) in our visuals. We can do this by coloring in nodes, either by name or by type. The logic is the same, with regard to node referencing in a network object; let’s color in the US and UK:
name_col<-c("United States","United Kingdom") #Vector of names to highlight
#Assign color to specific nodes with bracket and which() referencing#
V(bank_net_thin)$color[V(bank_net_thin)$name %in% name_col]<-"purple"
net_plot(bank_net_thin,layout=layout_with_fr(bank_net_thin),
title="Thinned Banking Surplus Flows (US & UK): 1990")
Let’s also try highlighting nodes with top 5% of brokerage:
#Distributional threshold for betweenness in top 5%#
bet_thresh<-quantile(betweenness(bank_net_thin),0.95)
#Bracket and which() referencing on score and threshold to color#
V(bank_net_thin)$color[betweenness(bank_net_thin) > bet_thresh]<-"blue"
net_plot(bank_net_thin,layout=layout_with_fr(bank_net_thin),
title="Thinned Banking Surplus Flows (High Brokerage): 1990")
There is of course, much more we can do with network plotting - we’ll get there. But, this is only an image, after all. It gives us little in the ways of inference. What can we learn from this in terms of descriptive statistics?
There are two classes of descriptive statistics one can initially gather from a network. The first pertains to node-level attributes, such as measures of centrality. The second pertains to topological indicators about the network as a whole, with measures like density and centralization. We’ll briefly review each in the context of 1990 banking surplus flows.
Sometimes we want to know information about an actor’s position within a network. Today we’ll cover three positional measures:
The most basic measure for nodes, which is the foundation for the majority of all others, is degree centrality. This is, literally, just the number of relationships a node has in the network. It can be separated out into in-degree (relationships pointing in) and out-degree (relationships pointing out) as well. Let’s quickly take a peek at these values in our network:
degDF<-data.frame(Name=V(bank_net)$name,
Deg_All=degree(bank_net),
Deg_In=degree(bank_net,mode="in"),
Deg_Out=degree(bank_net,mode="out"),
row.names = NULL)
head(degDF,15)
## Name Deg_All Deg_In Deg_Out
## 1 Japan 92 43 49
## 2 Luxembourg 139 85 54
## 3 Ireland 67 36 31
## 4 Switzerland 151 106 45
## 5 United Kingdom 157 100 57
## 6 Sweden 108 72 36
## 7 Netherlands 44 21 23
## 8 United States 109 64 45
## 9 Belgium 145 92 53
## 10 Germany 127 48 79
## 11 Denmark 93 54 39
## 12 France 147 89 58
## 13 Finland 73 44 29
## 14 Ghana 7 6 1
## 15 Greece 13 5 8
hist(degDF$Deg_All,breaks="FD",xlab="Degree",
main="Degree Distribution: 1990 Banking Network")
What wonders! We can get a census of this value for all nodes in the network, with a snap of our keyboard-typing fingers. Pure magic. As you can see from the histogram, degree follows a power distribution, which is common across a broad number of networks in the world (see: Barabasi and preferential attachment for higher level explorations of this dynamic).
Maybe raw scores don’t mean much to us though, and we’d rather see the normalized values. This simply takes the degree score and normalizes it by the total possible degree score - it’s thus perfectly collinear but gives us a different sense of centrality:
degDFn<-data.frame(Name=V(bank_net)$name,
Deg_All=degree(bank_net,normalized=T),
Deg_In=degree(bank_net,mode="in",normalized=T),
Deg_Out=degree(bank_net,mode="out",normalized=T),
row.names = NULL)
head(degDFn,15)
## Name Deg_All Deg_In Deg_Out
## 1 Japan 0.58227848 0.27215190 0.310126582
## 2 Luxembourg 0.87974684 0.53797468 0.341772152
## 3 Ireland 0.42405063 0.22784810 0.196202532
## 4 Switzerland 0.95569620 0.67088608 0.284810127
## 5 United Kingdom 0.99367089 0.63291139 0.360759494
## 6 Sweden 0.68354430 0.45569620 0.227848101
## 7 Netherlands 0.27848101 0.13291139 0.145569620
## 8 United States 0.68987342 0.40506329 0.284810127
## 9 Belgium 0.91772152 0.58227848 0.335443038
## 10 Germany 0.80379747 0.30379747 0.500000000
## 11 Denmark 0.58860759 0.34177215 0.246835443
## 12 France 0.93037975 0.56329114 0.367088608
## 13 Finland 0.46202532 0.27848101 0.183544304
## 14 Ghana 0.04430380 0.03797468 0.006329114
## 15 Greece 0.08227848 0.03164557 0.050632911
Beyond these, there are a few more measures you may be interested in:
Let’s collect those all and take a quick peek at them:
degDFo<-data.frame(Name=V(bank_net)$name,
EV_Cent=evcent(bank_net)$vector,
Constraint=constraint(bank_net),
BetweenN=betweenness(bank_net),
BetweenP=betweenness(bank_net,normalized=T),
row.names=NULL)
head(degDFo,15)
## Name EV_Cent Constraint BetweenN BetweenP
## 1 Japan 0.81675078 0.09810644 168 0.006772555
## 2 Luxembourg 0.72374080 0.09296226 5904 0.238006934
## 3 Ireland 0.39409131 0.11233969 4234 0.170684512
## 4 Switzerland 0.84876508 0.09501655 4193 0.169031686
## 5 United Kingdom 1.00000000 0.08451907 2653 0.106949931
## 6 Sweden 0.64247962 0.10314861 3726 0.150205595
## 7 Netherlands 0.57923615 0.11212721 56 0.002257518
## 8 United States 0.81783930 0.09432963 352 0.014190115
## 9 Belgium 0.74612284 0.09187300 6840 0.275739740
## 10 Germany 0.80390338 0.09532285 3202 0.129081674
## 11 Denmark 0.55281617 0.10639281 4668 0.188180279
## 12 France 0.88854257 0.09244901 2552 0.102878336
## 13 Finland 0.51544672 0.10783898 2966 0.119567846
## 14 Ghana 0.07080311 0.21292661 0 0.000000000
## 15 Greece 0.24985509 0.12171415 0 0.000000000
As you may be thinking, these are typically highly correlated indicators:
# install.packages("corrplot")
library(corrplot)
degDF<-cbind(degDFn[,2:4],degDFo[,2:5])
corrplot(cor(degDF),method="circle")
round(cor(degDF),2)
## Deg_All Deg_In Deg_Out EV_Cent Constraint BetweenN BetweenP
## Deg_All 1.00 0.98 0.96 0.91 -0.28 0.79 0.79
## Deg_In 0.98 1.00 0.89 0.88 -0.25 0.80 0.80
## Deg_Out 0.96 0.89 1.00 0.91 -0.31 0.72 0.72
## EV_Cent 0.91 0.88 0.91 1.00 -0.44 0.59 0.59
## Constraint -0.28 -0.25 -0.31 -0.44 1.00 -0.13 -0.13
## BetweenN 0.79 0.80 0.72 0.59 -0.13 1.00 1.00
## BetweenP 0.79 0.80 0.72 0.59 -0.13 1.00 1.00
These are the basic node-level descriptive statistics you can use as building blocks for applied network analysis in your projects. Importantly, you should not plug all of them in as covariates (that was a bit of a hint-cough with the correlation table). Rather, choose the one(s) which most closely align with your theoretical ideas about why network position matters (is it centrality? brokerage? constraint?) and use that in your statistical models (with caution) to draw inferences.
Let’s explore a few topology measures of networks. Unlike node-level measures, which offer information on nodes positions within the broader network, these indicators describe features of the network as a whole. In this brief introduction we’ll review three relevant measures:
Density tells us the relative proportion of ties observed in a network to the total number possible. In an undirected network, this is given \(\frac{n(n-1)}{2}\); in a directed network it is simply \(n(n-1)\), where \(n\) is the number of nodes in a network. In a network with 200 nodes, for example, the total possible number of undirected ties is 19,900. Quite often, the propensity for ties in a network is theoretically informative; in our banking network it can give us a sense of how globalized cross-border capital flows are in a year.
graph.density(bank_net)
## [1] 0.0547329
As we can see, this is quite low; only 5.5% of possible ties are exhibited in 1990. Notice we can get the same measure manually:
ecount(bank_net)/(vcount(bank_net)*(vcount(bank_net)-1))
## [1] 0.0547329
Centralization, in contrast, tells us about the degree of inequality in node-level measures. The most common metric is degree centralization; it ranges from 0 (where all nodes’ degree values are the same), to 1 (where there is perfect inequality among scores).
centr_degree(bank_net)$centralization
## [1] 0.4449007
As we can see, there is pretty stark inequality here (this was a bit observable in the earlier degree distribution we explored). Importantly, though, you can get centralization for other node-level measures like betweenness.
centr_betw(bank_net)$centralization
## [1] 0.1522049
Inequality in this score is much lower than in degree, which tells us a bit about the relationship between the two distributions in our observed network.
In many cases, this relationship hinges on another feature of networks: the average path length between any two nodes in the network. Remember that old ‘six-degrees of separation’ idea? This is where that came from. Average path length tells us the mean distance across all pairs of nodes, even those which are not directly connected.
average.path.length(bank_net)
## [1] 2.143005
For cross-border banking, it’s only about two hops from one country to another - this helps to understand why degree inequality may be so high when betweenness inequality is so low; nodes are generally very close to each other, and some are simply more connected than others. Importantly, you can also calculate pair-specific distances in a full \(n*n\) matrix, which can be helpful for actor-specific distance considerations against the broader average (are some closer, farther from others, on average?):
dist_mat<-distances(bank_net)
(dist_mat[1:5,1:5])
## Japan Luxembourg Ireland Switzerland United Kingdom
## Japan 0.000000 1.1947334 1.7914792 1.2692943 1.0262935
## Luxembourg 1.194733 0.0000000 0.5967459 0.1616604 0.3204588
## Ireland 1.791479 0.5967459 0.0000000 0.7584063 0.9129638
## Switzerland 1.269294 0.1616604 0.7584063 0.0000000 0.2430007
## United Kingdom 1.026294 0.3204588 0.9129638 0.2430007 0.0000000
mean(dist_mat[5,])
## [1] 1.302646
As we can gather just from the UK row, it is significantly closer to other countries in the cross-border banking network than the average. Who ever thought being a global financial center could be an empirical question!
These are interesting and fun for 1990, but maybe we’d like to see how these measures change over time. This can be easily operationalized within a for-loop over the years of data for which you have network observations:
years<-min(bank_edge$year):max(bank_edge$year)
topog_df<-data.frame(Year=years,Density=NA,DegCent=NA,BetwCent=NA,AvgPath=NA)
for(y in years){
net<-graph.data.frame(bank_edge[which(bank_edge$year==y),c(1:2,4)])
topog_df[which(topog_df$Year==y),2:5]<-c(graph.density(net),
centr_degree(net)$centralization,
centr_betw(net)$centralization,
average.path.length(net))
}
par(mfrow=c(2,2))
plot(topog_df$Year,topog_df$Density,"l",col="blue",xlab="Year",ylab="Density")
plot(topog_df$Year,topog_df$DegCent,"l",col="red",xlab="Year",ylab="Deg. Centr.")
plot(topog_df$Year,topog_df$BetwCent,"l",col="green",xlab="Year",ylab="Betw. Centr.")
plot(topog_df$Year,topog_df$AvgPath,"l",col="purple",xlab="Year",ylab="Avg. Path")
How delicious! We can clearly see some trends are in inherent tension, like density and centralization indicators. We can also tell that average path length drops as density increases, which follows intuitive logic about graph connectedness. As you can imagine, there are a broad number of other things you can do with these indicators for both descriptive and inferential applications. Given our limited time, it’s best we move forward into another, possibly more relevant network, to explore community detection in networks.
It is important to acknowledge that network analysis is not well-suited for causal inference. This is a strength, not a weakness of the approach. However, this does not mean that some broader logics of causal inference have not been folded into network analytic techniques. These rely on an older idea about network analysis: we can glean topological information about observed networks by comparing them to purely random networks with similar dimensions. This was developed in a paper by Erdos and Renyi, and serves as the foundation of virtually all network statistical techniques (for extensions, see work by Barabasi on why this may not be a solid foundation for inference).
For a very, very basic introduction, we will review two techniques in different packages:
This is basically a bootstrap approach for testing graph-level indicators against random graph expectations. We have to switch packages to another one, sna, which has stronger commands for this line of analysis. Let’s test on a parameter we haven’t reviewed yet, transitivity, which measures the degree of triadic closure in a network (proportion of closed triangles). In igraph, the command is simply ‘transitivity(network)’. When we run a CUG test, we control for some measure of network topography. For this first run, we’ll control on size (number of nodes), which tells the random network bootstrapping how to construct the random networks.
# install.packages("sna")
library(sna)
cug1<-cug.test(sa1,sna::gtrans,reps=1000,
mode="graph",cmode="size")
cug1
##
## Univariate Conditional Uniform Graph Test
##
## Conditioning Method: size
## Graph Type: graph
## Diagonal Used: FALSE
## Replications: 1000
##
## Observed Value: 0.8224027
## Pr(X>=Obs): 0
## Pr(X<=Obs): 1
plot(cug1)
Notice that our student network is significantly more transitive than a random graph would suggest. Also notice that this is somewhat trivial, given that the nulls all hover at 50% triadic closure with this control. You can also control for edge count (cmode=“edges”) and the dyad population (cmode=“dyad.census”), and test other topography indicators from SNA like centralization.
Exponential random graph models are a much more recent advancement in network statistics. These are basically regression models which treat the entire network as the unit of analysis, and predict likelihoods of ties as a feature of node, edge, and graph-level covariates. Let’s run a super-basic example on our student network testing only on edges (which functionally serves as an intercept by this logic):
# install.packages("statnet")
library(statnet)
net<-network(sa1,directed=F)
m1<-ergm(net ~ edges)
summary(m1)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: net ~ edges
##
## Iterations: 4 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges 0.26276 0.02559 0 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 8617 on 6216 degrees of freedom
## Residual Deviance: 8511 on 6215 degrees of freedom
##
## AIC: 8513 BIC: 8520 (Smaller is better.)
We can add a number of other interesting network covariates, like node attributes. For example, maybe we’re interested in the propensity for ties based on whether nodes are in the Americanist subfield:
set.vertex.attribute(net,"Comparative",subfield_adjacency[,1])
set.vertex.attribute(net,"Theory",subfield_adjacency[,2])
set.vertex.attribute(net,"IR",subfield_adjacency[,3])
set.vertex.attribute(net,"American",subfield_adjacency[,4])
set.vertex.attribute(net,"Methods",subfield_adjacency[,5])
set.vertex.attribute(net,"SFCount",rowSums(subfield_adjacency))
m2<-ergm(net ~ edges + nodecov("Comparative") +
nodecov("Theory") + nodecov("IR") +
nodecov("American") + nodecov("Methods") +
nodecov("SFCount"))
summary(m2)
##
## ==========================
## Summary of model fit
## ==========================
##
## Formula: net ~ edges + nodecov("Comparative") + nodecov("Theory") + nodecov("IR") +
## nodecov("American") + nodecov("Methods") + nodecov("SFCount")
##
## Iterations: 6 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % p-value
## edges -3.94558 0.24751 0 < 1e-04 ***
## nodecov.Comparative 1.94036 0.07622 0 < 1e-04 ***
## nodecov.Theory 0.13691 0.07402 0 0.064420 .
## nodecov.IR 0.21897 0.05955 0 0.000238 ***
## nodecov.American 0.47627 0.08598 0 < 1e-04 ***
## nodecov.Methods 0.70952 0.05508 0 < 1e-04 ***
## nodecov.SFCount 0.23177 0.04825 0 < 1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 8617 on 6216 degrees of freedom
## Residual Deviance: 6657 on 6209 degrees of freedom
##
## AIC: 6671 BIC: 6718 (Smaller is better.)
There are many, many, many parameters you can incorporate into ERGMs. We are not going to review those here for two reasons. First, the majority of structural parameters (like edgewise shared partners) take a while to optimize via MCML, especially with large networks. Second, and on a broader point, these models very often fail to converge.
As such, this introduction is more of a cautionary one; most extensions of the ERGM family of models make strong assumptions about network structure that should be carefully tested with descriptive statistics before trying to incorporate into these tests for inference. However, even these node level attributes can be informative for thinking about linkage propensity.